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Abstract 

Starting with a set of weighted items, we want to create a generic sample of a certain size that we can 
later use to estimate the total weight of arbitrary subsets. Applied to internet traffic analysis, the items 
could be records summarizing the flows of packets streaming by a router, with, say, a hundred records 
to be sampled each hour. A subset could be flow records of a worm attack whose signature is only 
determined after sampling has taken place. The samples taken in the past allow us to trace the history of 
the attack even though the worm was unknown at the time of sampling. 

Estimation from the samples must be accurate even with heavy-tailed distributions where most of the 
weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority 
to heavy items. At the same time, we want sampling without replacement in order to avoid selecting 
heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the 
first weight sensitive sampling scheme without replacement that is suitable for estimating subset sums. 
Testing priority sampling on Internet traffic analysis, we found it to perform orders of magnitude better 
than previous schemes. 

Priority sampling is simple to define and implement: we consider a steam of items i = 0, ...,n — 1 
with weights Wi. For each item i, we generate a random number a, e (0, 1) and create a priority 
Qi = Wi/oti- The sample S consists of the k highest priority items. Let r be the [k + 1)*'* highest 
priority. Each sampled item i in 5 gets a weight estimate Wi — max{z/;i, r}, while non-sampled items 
get weight estimate Wi = 0. 

Magically, it turns out that the weight estimates are unbiased, that is, ^[wi] — Wi, and by linearity 
of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight 
estimates from the subset. Also, we can estimate the variance of the estimates, and find, surprisingly, 
that the covariance between estimates Wi and Wj of different weights is zero. 

Finally, we conjecture an extremely strong near-optimality; namely that for any weight sequence, 
there exists no specialized scheme for sampling k items with unbiased weight estimators that gets smaller 
total variance than priority sampling with k+1 items. Very recently, Szegedy has settled this conjecture. 



Key words Subset sum estimation, weighted sampling, sampling without replacement. 



1 Introduction 

Starting with a set of weighted items, we want to create a generic sample of a certain size that we can later 
use to estimate the total weight of arbitrary subsets. Applied to internet traffic analysis, the items could be 

*A11 authors are researchers at AT&T Labs — Research, Shannon Laboratory, 180 Park Avenue, NJ 07932, USA (email: 
(duffield, lund, mthorup) Sresearch . att . com). 
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Figure 1: Priority sampling of size 3 from a set of 10 weighted items. 

records summarizing the flows streaming by a router, with, say, a hundred records sampled each hour. A 
subset could be flow records of a worm attack whose signature is only determined after sampling has taken 
place. The samples taken in the past allow us to trace the history of the attack even though the worm was 
unknown at the time of sampling. 

Estimation from the samples must be accurate even with heavy-tailed distributions where most of the 
weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to 
heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy 
items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight 
sensitive sampling scheme without replacement that is suitable for estimating subset sums. Testing prior- 
ity sampling on Internet traffic analysis, we found it to perform orders of magnitude better than previous 
schemes. 



1.1 Priority Sampling 

Priority sampling is a fundamental new technique to sample k items from a stream of weighted items so as 
to later estimate arbitrary subset sums. The scheme is illustrated in Figured We consider a stream of items 
with positive weights wq, Wn-i- For each item z = 0,..,n — 1, we generate an independent uniformly 
random Oj G (0, 1), and a priority qi = Wi/ai. Assuming that all priorities are distinct, the priority sample 
S of size k < n consists of the k items of highest priority. An associated threshold r is the {k + 1)**^ priority. 
Then i G S <^=^ qi > r. Each sampled item i e S gets a weight estimate Wi = maxjtDj, r}. If i ^ S, 
Wi = 0. We will prove 

E [wi] = Wi (1) 
By linearity of expectation, if we want to estimate the total weight of an arbitrary subset I Q [n\ = 
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{0, 1, . . . , n — 1}, we just sum the corresponding weight estimates in the sample, that is, 



(2) 



Ties between priorities happen with probabiUty zero, and can be resolved arbitrarily. We resolve them in 
favor of earlier items. Thus we view priority qi as higher than qj, denoted qi >~ qj, if either qi > qj or 
qi = qj and i < j. With any such resolution of ties, priority sampling works even if some weights are zero. 
Note that in the case of unit weights, r is just the {k + 1)**^ largest value 1/aj, and then Q simplifies to 



This unit case is a classic theorem in order statistics (see e.g., |2l|5l). 
1.2 Selecting Subsets 

We will now, with a few examples, illustrate how subsets could be selected. The basic point is that an 
item, besides the weight, has other associated information, and selection of an item may be based on all its 
associated information. To estimate the total weight of all selected items, we sum the weight estimates of 
all sampled items that would be selected. We note that the examples below could be based on any kind of 
sampling. What distinguishes priority sampling is the quality of the answers. 

Internet traffic analysis Our motivating application comes from Internet traffic analysis. Internet routers 
export information about transmissions of data passing through. These transmissions are called flows. A 
flow could be an ftp transfer of a file, an email, or some other collection of related data moving together 
A flow record is exported with statistics such as summary information such as application type, source and 
destination IP addresses, and the number of packets and total bytes in the flow. We think of byte size as the 
weight. 

We want to sample flow records in such a way that we can answer questions like how many bytes of 
traffic came from a given customer or how much traffic was generated by a certain application. Both of these 
questions ask what is the total weight of a certain selection of flows. If we knew in advance of measurement 
which selections were of interest, we could have a counter for each selection and increment these as flows 
passed by. The challenge here is that we must not be constrained to selections known in advance of the 
measurements. This would preclude exploratory studies, and would not allow a change in routine questions 
to be applied retroactively to the measurements. 

A killer example where the selection is not known in advance was the tracing of the Internet Slammer 
worm lim . It turned out to have a simple signature in the flow record; namely as being UDP traffic to port 
1434 with a packet size of 404 bytes. Once this signature was identified, the historical development of the 
worm could be determined by selecting records of flows matching this signature from a data base of sampled 
flow records. 

We note that data streaming algorithms have been developed that generalizes counters to provide answers 
to a range of selections such as, for example, range queries in a few dimensions |9|. However, each such 
method is still restricted to a limited type of selection to be decided in advance of the measurements. 

External information in the selection In our next example, suppose Wallmart saved samples of all their 
sales where each record contained information such as item, location, time, and price. Based on sampled 
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= n. 



(3) 
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records, they might want to ask questions like how many days of rain does it take before we get a boom in 
the sale of rain gear. Knowing this would allow them to tell how long the would need to order and disperse 
the gear if the weather report promissed a long period of rain. Now, the weather information was not part 
of the sales records, but if they had a data base with historical weather information, they could look up each 
sampled sales record with rain gear, and check how many days it had rained at that location before the sale. 

The important lesson from this example is that selection can be based on external information not even 
imagined relevant at the time when measurements are made. Such scenarios preclude any kind of stream- 
ing algorithm based on selections of limitated complexity, and shows the inherent relevance of sampling 
preserving full records for the perpose of arbitrary selections. 

1.3 Relation to classic sampling schemes 

What distinguishes priority sampling is how well it does in the common case of a heavy tailed weight dis- 
tribution 1 10|. The problem with uniform sampling is that it is likely to miss out on the small proportion 
of heavy items. An alternative is weighted sampling with replacement where each sample is chosen inde- 
pendently, each item being selected with probability proportional to its weight. The problem is that we are 
likely to get many duplicates of the heavy items, and hence provide less information on lighter items. A 
variant of weighted sampling with replacement for integer weights is to divide them into unit weights. This 
way we can get at most Wi samples of units from item i. However, when weights are large compared with 
the number of samples, this is still very similar to the basic weighted sampling without replacement. 

The above observations suggest that we need weight-sensitive sampling without replacement. For ex- 
ample, we can perform weighted sampling with replacement, but skip duplicates until we have the desired 
number of samples. The book |3 1 mentions 50 such schemes, but none of these provides estimates of sums. 
The basic problem is that the probability that a given item is included in the sample is a complicated function 
of all the involved weights. 

Clearly priority sampling acts without replacement. To see that it is weight-sensitive, suppose we have 
an item i which is r = > 1 times smaller than an item j. Then the probability that i gets higher 

priority than j is l/2r. More precisely. 



Priority sampling is thus weight-sensitive without replacement, and, as stated in it provides simple 
unbiased estimates of arbitrary sums. We will present tests of priority sampling on real Internet data, and see 
that estimating subset sums needs orders of magnitude fewer samples than uniform sampling and weighted 
sampling without replacement. 

1.4 Outline of the Paper 

The rest of the paper is organized as follows. In Section |2l we present the proof that priority sampling 
provides unbiased estimators as stated in ©. In addition we will show how we can estimate the variance of 
our subset sum estimates. This relies on the striking property of priority sampling that we establish, namely, 
that with more than one sample, the covariance between different weight estimates is zero. In Section |3l 
we compare our new priority sampling with threshold sampling from |7|, a scheme which is very closely 
related but does not provide a specified number of samples. In Section |4j we present experiments with 
priority sampling on real data from the Internet, demonstrating orders of magnitude gain in accuracy in 
estimation weight sums, as compared with uniform sampling and weighted sampling without replacement. 



Pr[gj > qj] = Vilwi/ai > Wj/aj] = Pr[aj < aj/r] 




4 



In Section |5j we analyze the performance of the different sampling schemes in some simple cases in order 
to gain further understanding of the experiments. In Section |6l we conjecture an extremely strong near- 
optimality; namely that for any weight sequence, there exists no specialized scheme for sampling k items 
with unbiased weight estimators that gets smaller total variance than priority sampling with k + l items. This 
conjecture was recently settled by Szegedy fT2l .q In Section we show how we can maintain a priority 
sample of size k for a stream of weighted items, spending only constant time on each item as it comes by. 
We finish with some concluding remarks in Section[8l 

A preliminary version of parts of this work was published in a conference proceeding j6l, including the 
basic announcement of the priority sampling scheme. Our original proof of Q was based on the standard 
proofs for the known unit case \2. 5|, but here we present a much simpler combinatorial proof and include 
an entirely new analysis of variance and covariance. The experiments reported here are all new, and so is 
most of the analysis of simple cases, as well as the conjecture concerning near-optimality. 

2 Unbiased estimation with priority sampling 

In this section, we will show that priority sampling yields unbiased estimates of subset sums as stated in 
Q. The proof is simpler and more combinatorial than the standard proofs for the known unit case |2 5|. 
We will also show how to form unbiased estimators of secondary weights. Finally, we consider variance 
estimation. We show that there is no covariance between the weight estimates of different items, and that 
we can get unbiased estimates of the variance of any subset sum estimate. 

Recall that we consider items with positive weights wq, ...,Wn-i- For each item i e [n], we generate 
an independent uniformly distributed random number G (0, 1), and a priority qi = Wi/ai. Priority qi is 
higher than qj, denoted qi >- qj, if either qi > qj, or qi = qj and i < j. A priority sample S of size k consists 
of the k items of highest priority. The threshold r is the {k + 1)^* highest priority. Then i ^ S <^=^ qi >- r. 
Each i e S gets a weight estimate ivi = m.ax{wi,T}. Also, for i ^ S,we define {vi = 0. Now ([1} states 
that E[idi] = Wi. 

We will prove that © holds for an item i no matter which values the other aj , j ^ i take. Fixing these 
values, we fix all the other priorities qj, j ^ i. Let r' be the A;''^ highest of these other priorities. We can 
now view r' as a fixed number. More formally, our analysis is conditioned on the event A{t') of r' being 
the A;**^ highest among the priorities qj, j i, and we will prove 

E[iB^\A{T')] = Wi. (4) 

Proving Q for any value of r' implies Q- The essential observation is as follows. 

Lemma 1 Conditioned on A(t'), item i is picked with probability min{l, Wj/r'}, and if picked, r = r'. 

Proof We pick G (0, 1) uniformly at random, thus fixing qi = Wi/ai. If qi ~< r', there are at least k 
priorities higher than qi, so i ^ S. Conversely, if q^ >- r', then r' becomes the {k + l)th priority among all 
priorities, so r' = r, and then i £ S. Finally, 

Pr[i e S\A{t')] = Pi[qi y t'] = Fr[ai < Wi/r'] = mm{l, Wi/r'} 
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From Lemma^ we get 



E[idi\A{T')] = Fr[i G S\A{t')] x E [idi\i gSA A{t')] 
= mm{l, Wi/r'} x max{zi;j, r'} 

= Wi 

The last equality follows by observing that both the min and the max take their first, respectively their 
second value, depending on whether or not Wi> t'. This completes the proof of @, hence of O 

2.1 Zero weight items and sampling it all 

We note here that priority sampling, as defined above, works even in the presence of zero weights. First we 
note that Wi = qi = Wi/ai = while Wi > <^=^> Qi = Wi/ q, > Wi > 0. It follows that 

zero weight items can only be sampled if all positive weight items have been sampled. Moreover, if we do 
sample a zero weight item i, we have t ^ qi = Wi = 0, so t = 0, and then wj = Wj for all items j. Having 
noted that zero weight items do not cause problems, we will mostly ignore them. 

Above we have assumed k < n, but we note a natural view of a priority sample of everything, that is, 
with k = n. We define an (n + 1)*'^ priority r = g„ = 0, as if we had an extra zero weight Wn = 0. Then 
Qi >- T = Qn for all i G [n], so all items get sampled. Moreover Wi = m.ax{wi,T} = Wi, so the weight 
estimate is equal to the original weight. 

2.2 Secondary variables 

Suppose that each item i has a secondary variable Xj. We can then use © to give unbiased estimators of cor- 
responding secondary subset sums. More precisely, we set Xi = WiXi/wi. That is xi = maxjiiJj, T}xi/wi = 
max{l, r/u;i}x,; if i is sampled; otherwise. Then © implies E[xi] = Xj. 

An application could be to deal with negative and positive weights Xj. We could define the priority 
weights as their absolute values, that is, Wi = \xi\, and use these non-negative weights in the priority 
sample. 

Another application could be if we had several different variables for each item. Instead of making an 
independent priority sample for each variable, we could construct a compromise weight. For example, for 
each item, the weight could be a weighted sum of all the associated variables. 

2.3 Variance estimation for a single item 

We now provide a simple variance estimator 

^ _ J WiT max{0, r — Wi\ if i G 5 
''^ ~ \ ifi 05 ' 

and show that it is unbiased, that is, 

E[vi]=yBr[wi]. (5) 

As in the proof of Q, we define A{t') to be the event that r' is the k^^ highest among the priorities qj , j ^ i. 
We will prove 

E[v,\A{T')\=E[w^M{r')]-wl (6) 
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From Lemma^ we get 



On the other hand, 



E[vi\A{T')] = Fr[i G S\A{t')] x E [vi\i E S A A{t') 
= min{l, it^i/r'} X r' max{0, r' — Wj} 
= max{0, Wit' — wf}. 



E[wf\A{T')] = Pr[i G S\A{t')] X E [wf\i eSA A{t') 
= min{l, Wi/r'} X maxjwj, r'}^ 
= max{wf ,WiT'}. 



This estabhshes ^ and hence 



2.4 Covariance 

Assuming that we sample more than one item, we will show that the covariance between our weight esti- 
mates is zero , that is, for > 1 and i j, 



E [wiWj] 



(7) 



If A; = 1, we have E ['fBi%] = since we cannot sample both i and j. 

Note that (Q is somewhat counter-intuitive in that if we sample i then this reduces the probability that 
we also sample j. However, the assumption that i is sampled affects the threshold r and thereby the weight 
estimate wj and somehow, the different effects cancel out. 

We will prove Q via the following common generalization of Q and (Q) holding for any I C [n], \I\ < 



k: 



n 

is/ 



Wi 



n 



Wi 



(8) 



If |/| > k, we have E [Hie/ ~ *-* since at most k items are sampled with Wi > 0. 

The proof of (ISj generalizes that of Q. Inductively on the size of /, we will prove that ^ holds no 
matter what values all the other Oj , j ^ I take. The equality is trivially true in the base case where I = % 
and the products equals one. 

Thus, for all j /, fix all aj G (0, 1) and priorities qj = Wj/aj. Fix r" to be the {k — \I\ + 1)**^ highest 
of these priorities qj j /. This priority exists because A; < |/| < ti. Next for i I, we pick G (0, 1) 
and set qi = Wi/ai. We can now have at most {k — |/|) + |/| priorities below r", so r" is at least as big as 
our new threshold r. 

Consider the case that / has a weight Wh > t" . Fix q/j G (0, 1) arbitrarily. Then qh > wj > t" > r. 



so item h is sampled with Wh = max{t(;/j, r} = Wh- Hence E [Ilie/ 

now fixed all aj, j / \ {m}, and by induction, E nie/\{m} ^« 
proof of (ISj in the case that Wh > t" . 



Wi] = WhE Ui(^i\{m}^i ■ We have 
= nje/\{m} ^i- This completes the 
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Next consider the case that all weights from / are smaller than t" . Let qi be the lowest priority from /. 
If qe, -< t" , then there are at least (A; — |/| + 1) + |/ \ = k priorities higher than qi, so q^ S, and 
We, = Q = Hig/ ^j- Thus, if qe ~< t" , there is no contribution to E [Hie/ • 

Conversely, if qi -< t", then all priorities from / are bigger than t". In this case there are exactly 
{k — \I\) + \I\ = k priorities higher than r", so r" becomes our threshold r. Then each i € S are sampled. 
Since Wi < t" = r, we get Wi = max{?i;j,r} = r. Hence Hig/^j = r"'/|. Since no weights in / is 
higher than r", the probability that all their priorities are bigger is nie/(^«/^")- Thus, the contribution to 
E [Hie/ "^i] nie/(^«/^") ~ Yliei "^^^ completes the proof of (ISj in the remaining case where 

Wh < t". 



2.5 Variance estimation over any subset 

We can now use our variance estimator from Section |231 to estimate the variance over any subset. By Q 
and ^ we get an unbiased estimator of the variance of any subset sum estimate simply by summing the 
variance estimators from the subset, that is, if A; > 1 for any subset / C [n], 



J2 

esni . 



Var[ = 

In fact, ^ also holds if A: = 1, but this is because y^rlJ^i^sni ^ 
shall return to this point later in Section ISHI 



(9) 



cxo for any non-empty subset /. We 



3 Comparison with a fixed tliresliold sclieme 

It is instructive to compare our priority sampling scheme with threshold sampling from f?!. In that approach 
each item is sampled independently, so we do not control the exact number of samples. Before sampling, 
a fixed threshold t^^^ is chosen. An item i is sampled if Wi > r^^^, or with probability w-i / r^^^ if 
Wi < T^^^. We denote the set of selected items by S'^^^. 

To see the relation to priority sampling, note that threshold sampling can be expressed in a manner 
similar to priority sampling as follows: generate a random number Qj G (0, 1] and sample item i if qi = 
Wi/ai > T^^^. As in our new scheme, the sampled items get weight estimate wj^^ = max{wi, r^^^} 
whereas wf^^ = if i S^^^. Thus the only difference between priority sampling and the threshold 
sampling from Q is in the choice of the threshold. In threshold sampling, the threshold is fixed independent 
of the random choices. Thus the threshold determines only the expected number of independent samples, not 
the actual random number of samples. By contrast, in priority sampling, the threshold is picked depending 
on the random choices so as to get a fixed number of dependent samples. We note that it is far from obvious 
that such a threshold could be chosen without violating the unbiasedness of estimation. 



3.1 Optimality of the fixed threshold scheme 

In fT\, the fixed threshold approach to independent sampling is proved to give an optimal trade-off between 
variance and sampling rate. More for an item i with weight Wi, we have to decide on a sampling probability 
Pi. If i is not picked, the weight estimate is zero, that is, Wi{pi) = 0. To get an unbiased estimator, if item 
i is picked, it should have weight estimate Wi{pi) = Wi/pi. Then E [wi{pi)] = Wi. Generally, we want to 
sample few items, yet keep the variance low. This motivates an objective of the form 

minimize pj + /3 Var [wi{pi)\ 
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application 


bytes 


% of traffic 


# flows 


% flows 


max flow size 


average 


min 


all 


4265677642 


100.00 


85680 


100.00 


3372865057 


49786 


28 


ftp 


3394832734 


79.58 


727 


0.84 


3372865057 


4669646 


40 


web 


80120429 


1.87 


7787 


9.08 


3139196 


10289 


40 


dns 


4083277 


0.09 


40767 


47.58 


621812 


100 


40 



Table 1: Statistics on ten minutes of flows from an Internet gateway router showing traffic from some 
different applications. Note that nearly half the flows belong to applications not mentioned. 



Here 

Var [wi{pi)] = E [{wiipi)f] - wf where E [{wi{pi)f] = p^{w^/p^f = wf/pi. 
Thus we want to 

minimize pi -\- (3 wj / pi 

For Pi G [0,1], the solution is to set pi = min{l, y^uij}. With = l/r^^^ this is equivalent to 
the fixed threshold scheme. That is, for any choice of r^^^, the fixed threshold scheme picks the pi = 

max{l, ttJj/r'^-'^^} so as to 

minimize pi + {l/r'^^^f Var [wi{pi)] . (10) 
Summing over the whole stream of items, we 



minimize ^ {pi + (1/r™)^ Var [Q,{p^)]) = K + l/(r^^^)' Var 



J2 



ten 



(11) 



We now constrain ourselves to getting an expected number k of samples. To minimize the total variance, we 
just have to identify r^^^ such that 



^p,= ^min{l,u;./r™} = fc 



is n is n 



With this value of t^^^, the fixed threshold scheme from [7| minimizes the total variance subject to un- 
biased estimation and an expected number k of samples. Any other assignments of individual sampling 
probabilities pi will do worse. The quality of our new scheme is largely inherited from this fixed threshold 
scheme, but we have some extra variability due to the variability of the threshold. 



4 Experiments 

We tested priority sampling on 10 minutes of flows from an Internet gateway router. For increasing sample 
sizes, we wanted to check our ability to estimate subset sums where each subset was defined by flows 
originated by certain applications such as FTP and web traffic. This illustrates how priority sampling can 
be used today in a backbone network. The basic flow statistics for the different applications is presented in 
Tabled We compared the following sampling schemes: 

PRI our new priority sampling. 

U— R uniform sampling without replacement. 
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W+R weighted sampling with replacement. 



THR the fixed threshold sampling from Q as described in Section|3] 

For weighted sampling with replacement, we note that there are two alternative ways of deriving weight 
estimates. More precisely, we have a list 5^+^ of k samples. Each sample ^^^^[j] is independent and 
equals item i with probability Wi/W where W is the total weight. The simplest unbiased estimator of Wi 
counts duplicates, estimating Wi as |{i|5'^+^[j]}| VF//c. However, we get a smaller variance if we just 
consider whether item i is present in S^^^. The probability of this event is 



We now describe the setup of the experiments; the interpretation of the results follows in the next sec- 
tions. In Figure |2l we compare the estimation accuracy of the different sampling schemes on the data sum- 
marized in Tabled For each sampling scheme, we progressively increased the size of sample by selecting 
more items from the data, estimating total weight in each application subset of the total flows for each 
sample size. 

In Figure |3j the same samples are used to estimate an 8 x 8 = 64 entry traffic matrix. Each matrix 
element corresponds to the traffic between an input and output interface on a router. We estimate the total 
bytes for each matrix element. Our accuracy measure is average over all elements of the relative estimation 
error. For priority sampling (PRI), uniform sampling without replacement (U— R), and weighted sampling 
with replacement (W+R), the number k of samples is exact. 

In threshold sampling (THR), the threshold determines only the expected number of samples. For each 
item i, we used the same priority qi = Wi/ai for priority sampling and threshold sampling. In priority 
sampling, we picked exactly k samples using the {k + 1)*^ priority r as a threshold. In threshold sampling, 
we computed the threshold t^^^ giving an expected number k of samples. Thus, for a given k, the only 
difference is in the choice of threshold. 

Finally, Figure 0]tells the number of distinct samples as a percentage of the target. For priority sampling 
(PRI) and uniform sampling (U— R) we have no replacement, so we get exactly k distinct samples, that is, 
100%. With weighted sampling with replacements (W+R) the duplicates mean that we get less distinct 
samples. Finally, with threshold sampling (THR), all samples are distinct, but we only have an expected 
number k of samples, hence the deviation from the target k. 

4.1 Discussion 

The quality of a sampling scheme is the number of samples it takes before the estimates converges towards 
the true value. 

4.1.1 Sampling exactly k samples 

First we compare our priority sampling (PRI) scheme with the other schemes providing an exact number 
k of samples, that is, with uniform sampling without replacement (U— R) and weighted sampling without 



Pi 



W+R _ 



1-(1 



and then we get the unbiased weight estimator: 
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Figure 2: Estimating traffic from different applications with different sampling strategies. The red line 
shows the true traffic from each application. 
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Figure 3: Estimating a traffic matrix with different sampling strategies. We divide the total error over all 
entries with the total traffic. 
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replacement (W+R). In Figure |2l and |3l we see that priority sampling provides very substantial gains in 
accuracy over the other schemes. 

When comparing the curves, there are two points to consider. One is how many samples it takes before 
we get one from a given application. This is the point at which we get our first non-zero estimates. Second 
we consider how quickly we converge after this point. 

Number of samples needed to hit an application With uniform sampling, the number of samples ex- 
pected before we get one from a given application is roughly the total number of flows divided by the 
number of application flows. In that regard, ftp traffic is clearly the worst. 

With weighted sampling without replacement, the expected number is roughly the total traffic divided by 
the application traffic. The worst application here is dns traffic which was the best application for uniform 
sampling. 

Priority sampling is like weighted sampling without replacement but it avoids making duplicates of 
dominant items. If the dominant items are outside the application, we waste at most one sample on each. 
The impact is clear for dns traffic where we get the first sample about 30 times earlier with priority sampling 
than we did with weighted sampling without replacement. A more direct illustration of the problem is 
found in Figure 0] where we see how the fraction of distinct samples drops in weighted samphng without 
replacement. 

Convergence after first hitting an application After we have started getting samples from an application, 
uniform sampling may still have problems with convergence. This typically occurs if the weight distribution 
within the application is heavy-tailed. Once again, ftp traffic is the worst application, this time because it has 
a dominant flow with more than 99% of its traffic. Until this flow is sampled, we expect to underestimate. 
If it is sampled early, we will hugely overestimate, although this is unlikely. The typical heavy-tail behavior 
is that the estimate grows as we catch up with more and more dominant items. We see this phenomena both 
for ftp traffic and for all traffic combined. 

With weighted sampling without replacement and with priority sampling, we get quicker convergence as 
soon as we start having samples from an application. Neither scheme has any problems with skewed weight 
distributions within the applications. For example, we see that weighted sampling without replacement 
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starts slower than uniform on web traffic, yet it ends up converging faster Similarly, priority sampling starts 
slower than uniform on dns traffic, yet it converging faster 



The traffic matrix Figure |3] shows the average relative error over 8 x 8 = 64 entries. We note first the 
poor performance of uniform sampling. In fact, it is only luck that the error with uniform is remains below 
100%. This is because we miss the dominant items and get under-estimates that can never be by more than 
100%. We could instead have gotten a dominant item early, leading to a huge over-estimate by far more than 



Comparing priority sampling with weighted sampling without replacement the faster convergence of 
priority sampling is very clear. For example, priority sampling gets down around a 1% error with about 
150 samples whereas weighted sampling with replacement needs about 3000 samples, and the weighted 
sampling falls further behind with smaller errors because it gets more and more duplicates. 

4.2 Priority sampling versus threshold sampling 

A reason to believe that priority sampling works very well for a fixed number of samples is its similarity with 
threshold sampling which for an expected number of independent samples minimized the total variance. In 
Figure 121 and |3l we see that indeed priority sampling (PRI) and threshold sampling (THR) are very close; 
neither having a systematic advantage. Hence, in our experiment, we see now loss in quality going from an 
expected number of samples (THR) to an exact number of samples (PRI). The variation in the actual number 
of samples with THR shown in Figure |4] 

As we shall see below, there are certain boundary phenomena that makes priority sampling perform 
significantly worse than threshold sampling. 

5 Analytic comparison of variance in some simple cases 

In this section, we will compare the different sampling schemes on some simple cases where we can analyze 
the variance, so as to gain some intuition for what is going on. 

Generalizing notation from Section|3j if is a weight and p G [0, 1] a sampling probability, we let w{p) 
denote the random variable that is w/p with probability p; otherwise. Then 



100%. 



E [w{p)] 
E [{wip)f] 



p{w/p)'^ = Ip 

2 I 2 2 

W /p — W = W 



W 



Var [w{p)] 



2I -P 



P 



It is also convenient to define the function 



v{w, t) 



w max{0, T — w} 



Then, with fixed threshold r , the variance for item i is 



Var[u}/ 



:THR 



Var [wi{max{l,Wi/ 
wf{l/ max{l, Wi/r' 
Wi max{0, r — Wi) 



/r™})] 
■™} - 1) 
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With our new priority sampling, the threshold changes, and the variance of item i is 



poo 

Var[idi]= / f{T')viwi,T')dT' (12) 

Jt'=0 

where /(r') is the probability density function for r' to be the fc*^ threshold amongst the items j ^ i. With 
t' thus defined, by Lemma^ item i is picked if qi = Wi/ai > t' with vui = r'; otherwise. This imitates 
the fixed threshold scheme with r' = t'^^^. Thus (fT2l) follows from the previous calculation with a fixed 
threshold. 

Sometimes it is easier with a more direct calculation. Summing over all j / i, we integrate over choices 
of aj, multiply with the probability that qj = wj/aj is the kth highest priority from [n] \ {i}, and multiply 
with the variance v{wi, qj). That is, 

\/ar[wi]= j Vr[\{h(^[n]\{i,i}\qh>- qj}\=k-l]v{wi,qj) daj (13) 



5.1 Infinite variance with single priority sample 

We will show that if we only make a single priority sample with k = \, then the variance of any weight 
estimate is infinite. The proof is based on (fT3l . We assume z = 0. For a lower-bound, in the sum, we 
only need to consider one other item j = 1. Also, when integrating over ai, we only consider very small 
values of ai. More precisely, define e = wi/{2W) where W is the sum of all weights. If ai < e, we have 
qi = wi/ai> 2W, and then 

Pv[\{h G [n] \ {i,j}\qh >- qj}\ = k - 1] = Pr[|{/i G {2, ...,n - l}\qh >- qi}\ = 0] 

/ie{2,...,n-l} 

= 1- Y (wh/2W) 

/ie{2,...,n-l} 

> 1/2 

Moreover 

v{wi, qj) = v{wo, qi) = wq max{0, wi/ai — wq} > wi/{2ai) 
Thus, by (fT3b . we have 

Var [wq] > / 1/2 • wi/(2ai) dai = oo 

^0 

We note that none of the other sampling schemes considered can get infinite variance. 

Next, we argue that the variance is bounded if we make at least two priority samples. Again, we focus 
on the variance for item i = 0. Also, it suffices to show that the integral in (fT3l is finite for each value of j, 
that is, we want to show that 

^ij = / Pi'[|{/i G [n] \ {i,j}\qh >- qj}\ = k - 1] v{wi,qj) daj 
Jo 
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is bounded. Now, for k>2. 



Pr[|{/iG N\{i,i}k/, ^g,}| =fc-l] < ^i[\{h€[n]\{i,3}\qh^qj}\>l\ 



< 



Moreover, 
so we get that 

Hence 



h&[n]\{i,j} 

he[n]\{i,j} 

X mm{l,Whaj/wj} 

h£[n]\{i,j} 

he[n]\{i,j} 

v{wi,qj) = Wi'niax{0,Wj /aj — Wi\ < WiWj/aj, 



< 



Waj/wj ■ WiWj/aj daj 



f 

Jo 



Wwi daj = Wwi. 



\/ar[wi] = Y < nWwi, 

ie[n]\{i} 

so indeed the variance is bounded. Since the covariance is zero, it also follows that estimates of weights of 
subsets are bounded. Thus we have proved 

Proposition 2 If we make a single priority sample, then all weight estimates have infinite variance. With 

more than one priority samples, all weight estimates are finite. 

By contrast, with all the other sampling schemes, the variance estimates are finite as soon as we make at 
least one sample. 

5.2 Unit weights 

We will now study identical unit weights, focusing on the first item i = 0. We will compute the exact 
variance for each of the sampUng schemes considered. 



Uniform sampling without replacement For uniform sampling without replacement, item is picked 
with probability p^^^ = k/n, hence with 



Var 



U-R 




1 - p^-^ __ n-k 
JJ-R ~ 'h~ 



15 



Weighted sampling without replacement For weighted sampling with replacement, item is picked with 
probability p^~^^ = 1 — (1 — l/n)*^, hence with 



Var 



-;W + R 



l-p^+^ _ (1 - 1/n)^ 



Po 



l-(l-l/n)^ 



For k <^ n, the variance approaches from above. However, for k = n, the variance approaches 
l/(e(l - e-i) = 0.58... 

Fixed threshold In the fixed threshold scheme from fl\, we set t^^^ = n/k. Then 

Var = ^^o(t™) = max{0, r^^^ - w,} = ^ (14) 

k 

Priority sampling For priority sampling, we will evaluate (fT3l exactly. We use that 

Prfeh >- qi] = Pr[a,i < ai] = oi 

and 

v^iqi) = Wo max{0, qi - wq} = (1/ai - 1) 

so 

n-1 



n-i „i 

Var[u)o] = / Pr[|{/iG{2,...,n-l}|g;,^gi}| =A:-l]t;o('7i)d«i 

/ Pr[B(n-l,a) = /c-l](l/ai -1) dai 

Jai=0 



,=1^0 

(n-1) 



ai=0 



(n-1) / {■l^_:^]at\l-a,r~>'+Uai 



n — k 
k-1 

Discussion For unit weights, uniform sampling without replacement and threshold sampling gets the sam- 
ple variance on single item weight estimates; namely When k is not too small, priority sampling 
gets nearly the same variance; namely Weighted sampling with replacement starts doing well, but gets 
worse and worse as k grows. In particular, for any A; > n, it has positive variance while all the other schemes 
have zero variance since they have no replacement. 



5.3 Large and small weights 

In this section we illustrate what happens when different weights are involved. We consider the case where 
we have £ large weights of weight and n unit weights. The large weights are first, that is, tt^o = • • • = 
W£^i = N while wi = ■ ■ ■ = = 1. We let W = IN + n denote the total weight. We view £, n, 

and N as unbounded. We assume i <^n <^ Vn and that k ^ n. These assumptions will help simplifying 
the analysis. We will use wq as a representative for the large items and Wn as a representative for the small 
items. The results variances from the different sampling schemes will be accumulated in Table|2l 
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Uniform sampling without replacement For uniform sampling without replacement, the large item is 
picked with probabiUty p^~^ = k/{n + £), hence with 



Var' ' 



p^-^ k k' 



For small item n, we have the same sampUng probability, ^ = /c/(n + £), so we get 

1 — pII~^ n 



Var [w^-^] = 



Pn 



U-R ^ jL 
i„ ft. 



Weighted sampling with replacement For weighted sampling with replacement, the large item is 
picked with probabiUty p^+^ = 1 - (1 - N/W)'' ^ 1 - e'^l^ hence with 



_ W+R —kit 



In particular, this is Q^N"^) for k = Q{i). ^ Yet it saves a factor n over uniform sampling with replacement 
in the case of large weights. 

For weighted sampling with replacement, the small item ? = n is picked with probability p^'^^ = 
1 - (1 - 1/W)'' « k/W ^ k/{lN) < 1, hence with 

Var [w^^^] = « m/k 

Pn 

Fixed threshold For the fixed threshold scheme, if k < I, we set t^^^ = W/k > N. Then for heavy 
item 0, 

Var [^5™] = ^;(u;o, r™) = N{W/k - N) ^ N^^-^ 

while for a light item n, it is 

Var [u;™] = i;(u;„,r™) = {W/k - 1) « Nl/k 

On the other hand, for k > £, v/e pick a threshold below N; namely r^^^ = {n — £)/{k — £). Then for 
heavy item 0, 

Var = 

while for a light item n, it is 

Var [{J}™] = v{wn, r™) = {n - £)/{k - I) ^ n/{k - I) 



'/("■) = ©(flC"-)) iff there exist a, 6 > such that af{n) < g(n) < bf(n) for all sufficiently large n. 
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Priority sampling First we consider big item 0. To compute the variance, we sum over the events A{m) 
that we have m small items with priorities bigger than N, multiplying the probability of A{m) with 

E [wl\A{m)] -wl = E [wl\Aim)] - E KI^M]' = Var [wo\A{m)] . 

Trivially, Pr[A(m)] = Pi[B{n, 1/N) = m]. Consider a small item i. Conditioned on having a big priority 
Qi > N, item i acts like a heavy item. Conversely, conditioned on having a small priority qi < N, item i 
has no impact on the weight estimate of a heavy items. Thus, in the event A{m,), the variance of item is as 
if we had I + m heavy items and no small items. If £ + m < k, the threshold is at most N, and then there is 
no variance. If £ + m > k, the analysis from the uniform unit case shows that 



VarK|^(m)]=7V2i±^--^ 

Thus 

Var [wo] = ^ Pr[y4(m)] Var [uJqI^M] = Pr[S(n, 1/N) = m] ^ 

m=0 m=max{0,fe— ^+1} 

Since N S> n^, the first term dominates, so with m = maxjO, A; — ^ + 1}, we get that 

Var [wq] « Pi[B{n, 1/N) = m] AT^ 



^')£ + m — k 



it- 1 

If A; < ^, we get m = 0, and then 

Var [wo] ~ Pr[S(n, 1/N) = 0] N^^^ ^ iV^^— ^ 

k — 1 k — 1 

If k > £, v/e get m = k — £ + 1, and then 

yar[wo] ^ Pi[B{n,l/N) = k-£ + l]N^/k 

< {n/N)^-^+^N'^/k 
= nN{n/Nf-^/k 

We now consider the light item n. We are going to prove that Var k, Nl/{k — 1) if k < I, 
Var [wn] « nlniV if A; = £ + 1, and Var [wn] ~ n/{k - £ - 1) if k > £ + 1. 

We consider two different contributions to the variance depending on whether the threshold r is greater 
than AT. If r > N, we further distinguish depending on whether > N. If t > N and qn < N, then 
w;„ = so the variance relative to Wn is 1. The probabiUty of this event is 

n—l 

Pj:[qn<N] Pr[B{n-l,l/N) =m]^PT[B{n-l,l/N) =max{0,k-£ + l}] 

m=max{0,fe— £+1} 

If A; < ^, this is a variance contribution close to 1, and if A > ^, the variance contribution is bounded 
by Pr[B{n - 1,1/N) = k - £ + 1] < {n/N)^-^+^. In either case, this contribution to the variance is not 
significant. 

Next consider the case that t > N and g„ > N. The probability that qn > N is 1/N. Let A'{m) denote 
that event that we have m small items i ^ n with q^ > N. Conditioned on g„ > N, we have r > N if 
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and only if m > k — L In this case, the variance contribution is E [w'^] — 1. However, Wn behaves like the 



weight estimate of heavy item among £ + m + 1 heavy items, so E > N A A^m)] = N"^ 



Thus we get a variance contribution of 

n-l 

PrfBCn- 1. 1 /ATI = ml ( N"^- 

k-1 



r7i=max{0,A;— £} 

1/iV • Fv[B{n - 1, = max{0, k - i}] N 



2 £ + max{0, k- i} 



For k < £, this is approximately N£/{k — 1), which dominates the variance. For k > i, this variance 
contribution is approximately, NPi[B{n — 1, = k — £] < {n / N)''~^~'^ , which is insignificant. 

We now consider the case where t < N. This requires k > £ and is like the unit case, except that we 
only sample k' = k — £ items. Hence we can apply the integral from the unit weight case, but with the 
restriction that a>l/N. We then get a variance contribution of 

= (n-l)/" Vi[B{n-l,a) = k' -l\{l/a-l) da 

Ja=l/N 



For k' > 2, the impact of starting the integral at 1/A^ is insignificant, so we get an variance contribution 
which is approximately = "^Z^^i ~ ^'^^ A;' = 1, we get a variance contribution of 



1 



(n-l)/ a^''-2(i_a)"-^'+^da<n / a-^da = nlnN. 

Ja=l/N \k — 1/ Ja=l/N 

This completes the analysis of priority sampling for large and small weights. A comparison of all the 
sampling schemes is summarized in Table |2l 



Discussion With reference to Table |2 the problem with uniform sampling is that it does a terrible job on 
the large weights, performing about n/^ times worse than the other schemes. On the other hand, it gives the 
best performance on the small items. However, the advantage over threshold and priority sampling becomes 
insignificant when k ^ £. This illustrates that if the number of dominant items is small compared with the 
number of samples, then threshold and priority sampling do very well even on the small items. 

The problem in weighted sampling with replacement is that it does poorly compared with threshold and 
uniform sampling when the number of samples exceed the number of dominant items. This is both large 
and small items, illustrating the problem with duplicates. 

Finally, comparing threshold and priority sampling, we see that priority sampling has positive variance 
for A: > £ whereas threshold sampling has no variance. However, this variance of priority sampling is very 
small compared to a weight of A^, so it is a case where priority sampling is doing very well anyway. It is 
more interesting to see what happens with the small items. The major differences are in the two boundary 
cases when k = 1 and when k = £ + 1. The former case has infinite variance as discussed previously. For 
A; = ^ + 1, we see that priority sampling does worse by a factor of In A^. This is only by the logarithm of the 
ratio of the large weight over the small weight, and it is only for the special boundary case when A; = ^ + 1 
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l<k <£ 


k = i 


k = e + i 


k>l 


large item 


U-R 


N'^n/k 


W+R 


N'^/{e^''^ - 1) 


THR 




nN/l 





PRI 




nN/l 


< nN{n/N f-'^/k 


small item 


U-R 


n/k 


W+R 


Ni/k 


THR 


Ni/k 


n/{k-l) 


PRI 


Ni/{k - 1) 


n\n.N 


n/{k-l-l) 



Table 2: Overview of variance with k samples, in the case of I large items of size N. 



that we have such a big difference. It is therefore not surprising that this kind of difference did not show up 
in any of our experiments. Also, we note that in this special case, weighted sampling with replacement is 
performing even much worse; namely be a factor of N/n. 

Thus, in our analysis, priority sampling performs very well compared with the other schemes for sam- 
pling exactly k items, and it is only in rather singular cases that it performs a worse than threshold sampling. 

Tailoring a better scheme for k = l+l samples In our large-small weight example, for k not too small, 
priority sampling is only beaten by threshold sampling, which, however, does not sample exactly k items. 
In particular, priority sampling is outperformed for k = l+l. We will now construct a sampling scheme for 
this particular case which samples exactly k items and gets the same performance as threshold sampling for 
any k > I. Like threshold sampling, the tailored scheme picks all the k large items. Moreover, it picks k — I 
items uniformly without replacement among the small unit items. From our study of the unit case, we know 
that uniform sampling gets the same variance as that of priority sampling on the small items. Thus each item 
gets the same variance with our tailored scheme as threshold sampling, but that our tailored scheme samples 
exactly k items. 

6 Conjectured near-optimality of priority sampling 

Recall from Section |3l that threshold sampling minimizes the total variance when we do independent sam- 
pling getting an expected number of k samples. We would have liked to provide a somewhat similar result 
for priority sampling among schemes sampling exactly k items, but we know that this is not the case. For 
unit items, uniform sampling without replacement got an item variance of while priority sampling got 
an item variance of ^^f- Also, for our large-small item, we found a specialized scheme outperforming 
priority sampling when k > I. 

We formalize our intuition as the conjecture that if priority sampling is allowed just one extra sample, it 
beats any specialized sampling scheme on any sequence of weights. More precisely. 

Conjecture 1 For any weight sequence wq, Wn-i and positive integer k < n, there is no tailored scheme 
for picking a sample S Q [n] of up to k items i with unbiased weight estimates wi ( that is, Wi = if i ^ S 
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and E[wi] = Wifor all i G [n]) so that the total variance fXliefn] [^«]^ '■^ smaller than with a priority 
sample of size k +\. 

The conjecture also covers tailored schemes where the same item is picked multiple times, or where less than 
k samples may be picked, as in weighted sampling with replacement. If we have multiple weight estimates 
for an item i, we add them up to a single weight estimate Wi, and if the sample has less than k items, we 
add extra items j with Wj = 0. Thus the tailored scheme is transformed into one that always picks exactly 
k distinct items. 

In fact. Conjecture His equivalent to the following conjecture relating priority sampling to threshold 
sampling: 

Conjecture 2 For any weight sequence wq, ...,Wn-i and positive integer k, threshold sampling with an 
expected number ofk samples gets a total variance which is no smaller than with a priority sample of size 
k + l. 

One consequence of Conjecture His that if we only have resources for a certain number k of samples, then 
we are much better off using priority sampling than using threshold sampling for a small enough expected 
number of samples, e.g., k — 2-\fk, that the probability of getting more than k samples is small. 
To see that Conjectures ^and|2lare equivalent, we prove 

Proposition 3 For any weight sequence wq, Wn-i and positive integer k, there is no scheme for picking 
a sample S C [n] of k items i with unbiased weight estimates Wi so that the total variance is smaller than 
with threshold sampling with an expected number ofk samples. In fact, given the weight sequence, we can 
construct an optimal scheme for picking k items getting exactly the same total variance as that of threshold 
sampling. 



Proof Let be a scheme for picking a sample S C [n] of k items i with unbiased weight estimates wf . We 
then consider the corresponding scheme X(^) for independent sampling. More precisely, considers 
each item i independently, picking i with the same probability pi as does ^, and with the same probability 



distribution on the weight estimate ijf^*^ as ^ induces on its weight estimate wf . Then E 



E \wf\ = Wi and Var 
samples withX(^) is 



w, 



W 



J — Var \wf^^ . Moreover, by hnearity of expectation, the expected number of 



iGfnl 



Thus the independent sampling scheme T{^) has unbiased estimators like $, an expected number of k 
samples, and the same item variances as 

Now, suppose for some item i that X(<I>) has more than one possible non-zero weight estimate w'f^'^\ 
We then make an improved sampling scheme T*(<I>) which picks item i with the same probability pi as <I> 



and but which then always uses the same weight estimate w, 



■I'm 



w. 



im 



Then 



w, 



W. 



Wj and Var 



w 



< Var 



W: 



Var with strict inequality if 



Var 



Wi 



> 0. For example, we have strict inequality if ^' is a priority sampling scheme 
with k < n. 

The optimized scheme has the same format as the schemes considered in Section |5] and we 

know that threshold sampling minimizes the total variance among these schemes. Consequently, with THR 
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denoting threshold sampUng of an expected number of k items, it follows that 



Var [id. 



THR-\ 



< 



iG n 



iGfnl 



r 



We will now go the other way. Our staring point is an independent sampling scheme $ that picks each 
item i independently with probability pi, and which picks an expected integer number k of samples. If item i 
is picked, it gets weight estimate wf = Wi/pi; otherwise. For example, <I> could be our threshold sampling 
scheme THR. We will now define a corresponding sampling scheme S{<^) picking exactly k samples, and 
getting the same variance for each item. 

We are going to describe an iterative procedure defining £{^). Initially, set Vi = 1 — pi for all i G [n]. 
We are going to define different events, and as we do so, reduce pi and rj so as to reflect the remaining 
probability that item i is picked or not picked, respectively. After each iteration, we have a remaining total 
probabihty P = po + tq = pi + ri = ■ ■ ■ = Pn-i + ''n-i- In each event we pick exactly k items, and 
since we start with an expected number of k items, we will always have an expected number of k items in 
the remainder, that is {J2ie[n] Pi)/P = ^• 

Consider an item i. If pi = 0, item i is not picked in any remaining event. Conversely, if = 0, item 
i is forced to be picked in all remaining events. If pi > and > 0, item i is "unsettled". If there are no 
unsettled events, we have a final event, doing what has to be done: since {J2iG[n] Pi)/P = k and since each 
Pi is either P or 0, there are exactly k items i with pi = P, and these are all picked. 

Assume that we have some unsettled items i. Let n' be the number of unsettled items. Also, let k' be 
the number of items to be picked among the unsettled items, that is, we subtract the forced items that have 
to be picked because = 0. In our next event A, we want to pick the forced items and k' items uniformly 
from the unsettled items. Then item i is picked with probabihty k'/n'. Hence, if Pa is the probability of the 
event A, then for each unsettled item i, we will reduce pi by PAk' /n', and by PA{n' — k')/n' . 

We choose Pa maximally, subject to the condition that no pi or may turn negative. Then 

Pa = iLnax.{pin' /k' , Tin' /{n' — k')\i^ [n], pi > 0, rj > 0} 

With this choice of Pa, the event A will settle at least one item, so it will take at most n iterations to define 
the samphng scheme 

By definition, for each item i, we get the same distribution of weight estimates with as with 
hence also the same variances. In particular it follows that £{THR) has the same total variance as does 
threshold sampling. 

Note that when k < n, the total variance of £{THR) is always smaller than that of priority sampUng 
since 



EVar 



w. 



S{THR) 



ie\n\ 



= EVar[^5™] 

ie[n] 

< 5] Var 

ie[n] 



W- ^ ' 



«G n 



For example, this was how we improved priority sampling in the special case at the end of the previous 
section. ■ 
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As evidence for Conjecture|2j we note from Section|5lthat it is true for the unit case. Also, it can be proved 
to hold for the large-small example using a more refined analysis. Finally, we note that the conjecture 
conforms nicely with the closeness of priority sampling and threshold sampling in the experiments from 

Section|4l Also, at appears that we can prove an asymptotic version; namely that X]ie[n] V^'' j^iwf ^^^'^^^^ < 

a J2ie[n] wj^^^''^ where a is a large enough constant, PRI[A; + 1] is priority sampling of + 1 items, 
and THR[fe] is threshold sampling of an expected number of k items. However, this is complicated, and 
beyond the scope of the current paper. 

Very recently, Szegedy IIT2I has settled Conjecture |2l By the above equivalence, his proof also implies 
Conjecture n Thus priority sampling is variance optimal modulo one extra sample. 



7 Sampling from a stream 

In this section, we will discuss how we can maintain a sample of size k for a stream of items i = 0, 1, 2, ... 
with weights Wi. 



7.1 Reservoir sampling 

In so-called reservoir sampling, at any point in time, we want to have a sample of size k from the items seen 
so far. Thus, if we have seen items 0, n — 1, we should have a sample 5 C [n]. The individual samples 
are denoted 5 [0],.., 5 [/c - 1]. 



Uniform sampling with replacement This case was studied by Vitter |14|. Let S*^"^ C [n] be the 
current sample. While n < k, we have S[i] = i for i = 0, n — 1. When item n > k arrives, we pick a 
random number j G [n + 1]. If j < k, we set S^~^[j] := n. Finally, we set n := n + 1. All this takes 
constant time for each item. 

We note that the weight estimates are only maintained implicitly via n. If j G S^^^, then wj = jWj 
where n is the current number of items. 



Weighted sampling with replacement This case was studied by Chaudhuri et al.|4l. Besides maintaining 
a sample S^^^ C [n], we maintain the total current weight W = J2ie[n] When item n arrives, for 
j = 0, A; — 1, we pick a random number a G (0, 1). If a < ppq^, we set := n. When done 

with all samples, we set W := W + Wn- Note that if we had Wn > W, we would expect to change at 
least half the samples, so for exponentially increasing weight sequences, we spend @{k) time on each item. 
However, in |4||, it is falsely claimed that their algorithm spends constant time on each item. 

Using the current value of W, we can compute the weight estimates of the sampled items as described 
in Section m 



Priority sampling Priority sampling is trivially implemented using a standard priority queue |T||. Recall 
that for each item i, we generate a random number a-i G (0, 1) and a priority qi = Wi/ui. A priority queue 
Q maintains the k + I items of highest priority. The k highest form our sample S, and the smallest in Q 
is our threshold r. 

It is convenient to start filling our priority queue Q with k + 1 dummy items with weight and priority 0. 
When a new item arrives we simply place it in Q. Next we remove the item from Q with smallest priority. 
With a standard comparison based priority queue, we spend 0(log k) on each item, but exploiting a floating 
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point representation, we can get down to 0(log log k) time for item fTT| (this counts the number of floating 
point operations, but is independent of the precision of floating point numbers). This is substantially better 
than the Q{k) time we spend on weighted sampling with replacement, but a bit worse than the constant time 
spent on uniform sampling without replacement. We shall later show how to get down to constant time if 
we relax the notion of reservoir sampling a bit. 

Reservoir sampling for threshold sampling In fj], the threshold r^^^ was determined before items 
where considered. The threshold was adapted to the traffic to get a desired amount of samples, yet bursts in 
traffic lead to bursts in the sample. Here, as a new contribution to threshold sampling, we present a reservoir 
version of threshold sampling which at any time maintains a sample S'^^^ of expected size k. 

As items stream by, we generate priorities as in priority sampling. At any point, n is the number of items 
seen so far. We maintain a threshold t^^^ that would give an expected number k of items, that is, 

min{l, uji/r™} = k (15) 

ie[n] 

Also, we maintain the corresponding threshold sample, that is. 

The sample S^^^ is stored in a priority queue. When a new item n arrives it is first added to S'^^^. Next 
we have to increase r^^^ so as to satisfy (fTSl with n' = n + 1. Finally, we remove all the items from S'^^^ 
with priorities lower than t^^^. Thanks to the priority queue, each such item is extracted in 0(log k) time. 

We still have to tell how we compute the threshold. Together with the sample, we store the set L of all 
items i with weight Wi > t^^^. Also, we store the total weight U of all smaller items. We note that the set 
L is contained in S'^^^. Now, 

^ min{l,u;i/T™} = \L\ + U/r™ 

iG[n] 

The items z in L are stored in a priority queue ordered not by priority pi but by weight Wi. When item n 
arrives we do as follows. If Wi > t^^^, we add i to L; otherwise we add its weight Wn to U. 

Next we increase t^^^ m an iterative process. Let r* = U /{k — \L\) and let Wj be the smallest weight 
in L. If L was empty, Wj = oo. If r* < Wj, we set r^^^ = r*, and we are done. Otherwise, we set 
^THR _ remove j from L, add Wj to U, and repeat. 

In the above process, each item is inserted and deleted at most once from each priority queue. Also, 
at any time, the expected size of each priority queue is at most k, so the total expected cost per item is 
0(log A;). Exploiting a floating point representation of priorities, this can be reduced to 0(loglog A;) time. 
Thus we get the same time complexity as for priority sampling, but with a more complicated algorithm. 



7.2 Relaxed priority sampling 

We will now bring down the time per item to constant for priority sampling. To do this, we relax the notion 
of reservoir sampling, and set aside space for 2k + 2 items. Instead of using a priority queue, we use a buffer 
B for up to 2A; + 2 items. The buffer is guaranteed to contain the k + \ items of highest priority. When it gets 
fuU, a cleanup is performed to reduce the occupancy down to + 1. Using a standard selection algorithm 
ID, we find the [k + 1)^*^ highest priority in B, and all items of lower priority are deleted, all in time linear 
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in k. The cleaning is executed once for every k + l arrivals, iience at constant cost 0(1) per item processed. 
After cleanup, we resume filling the buffer with fresh arrivals. 

A further modification processes every item in constant time without having to wait for the cleanup to 
execute. Two buffers of capacity 2A; + 2 are used, one buffer being used for collection while the other is 
cleaned down to m + 1 items. Then each item is processed in constant time, plus 0{k) time at the end of 
the measurement period in order to find the k + l items of largest threshold from the union of the contents 
of the two buffers. Thus, provided the between successive arrivals should be bounded below by the 0(1) 
processing time per item, the processing associated with each flow can be completed before the next flow 
arrives. 

We note that similar ideas can be used to get constant processing time per item for weighted sampling 
with replacement and threshold sampling. 



8 Conclusions 

We have introduced priority sampling as a simple scheme for weight sensitive without replacement that is 
very effective for estimating subset sums. 
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